Introduction to R

Master 1 Economy & Psychology — Paris-Cité — 2025-2026

Charly Marie

charly.marie.pro@gmail.com

October 17, 2025

1 Lecture organisation

We only have three hours, but we’ll make the most of them!

This course introduces R, which is both
- A programming language
- A software
We only have three hours. Therefore, you will not be manipulating data in this class. Instead, the course provides you with a toolbox
- Researchers mostly follow the same steps in each data analysis project: import data, clean and prepare data, describe data, analyse data, plot data, export the plots and results
- Slides represent a function that is essential for the first three steps
- (Free) online resources are provided for the last three steps
My aim is for you to grasp the fundamentals of R so that you can reuse the functions I demonstrate and train yourself using free online resources

2 The (very) basics of R

From Denise Earle on X

Why use R ?

R is an important tool for many data analysts who use it to prepare, analyse and plot data, as well as create websites, reports and presentations (like this one, which was made entirely in R). But most importantly, it
- is free
- is open-source
- allows anyone to reproduce your results (see, for instance, the Institute For Replication)
- is very powerful and flexible
- has a very engaged community that you can ask for help

R’s logic

R is basically a big fancy calculator

R’s logic

R is basically a big fancy calculator

Very few people use base R, most use an Integrated Development Environment (IDE) such as RStudio

Very few people use base R, most use an Integrated Development Environment (IDE) such as RStudio

Very few people use base R, most use an Integrated Development Environment (IDE) such as RStudio

Very few people use base R, most use an Integrated Development Environment (IDE) such as RStudio

Very few people use base R, most use an Integrated Development Environment (IDE) such as RStudio

Tip

R is designed for scripts. Write your code in scripts, so that you can: (1) track everything you have done; (2) reproduce everything you have done; (3) share everything you have done.

Very few people use base R, most use an Integrated Development Environment (IDE) such as RStudio

Pro tip: the first time you use RStudio, go to Tools –> Global Options… and
1. Untick the box Restore .RData into workspace at startup
2. Set Save workspace to .RData on exit to Never
- I really advise you to do this so that each time you open a new R session, you start with a brand new, empty environment. Imagine still having loaded data named expe1 and not understanding why the right columns are not showing? Been there, done that

R is an object-oriented language

R is essentially a big, fancy calculator
- Do you want to find $\hat{\boldsymbol{\beta}} = \underset{\boldsymbol{\beta}}{\arg\min} \; \sum_{i=1}^{n} \big(y_i - \mathbf{x}_i^\top \boldsymbol{\beta}\big)^2$ ? It can do that for you — if you know how to ask!
R works with objects. Objects can be (almost) anything
- A single number or multiple numbers
- A single character or character string (a word, multiple words, a sentence, multiple sentences, etc.)
- A data frame comprising numbers and character strings
- A plot
- The output of a Generalised Random Forest
- And, most importantly, all of the above, nested like Russian dolls 🪆
Combine a calculator with objects and you get the essence of R: a large calculator that manipulates objects

R is basically a big (fancy) calculator

# Addition
1 + 1

[1] 2

# Subtraction
19 - 8

[1] 11

# Division
log(x = 98) / sqrt(x = 108)

[1] 0.4411887

# Multiplication
78^83 * (1 / log(x = sqrt(x = 32^3), base = exp(1)))

[1] 2127969009576438321662422224286864628246682226688202424468224488848044260660204006846282880004802664460642226688200044466840040824846488266880020484242462244

Tip

Anything that comes after a # in an R script is a comment that will never be run by R.

Comments are very useful for remembering why and how you wrote your script and for explaining it to yourself and others (think reproducibility).

R is basically a big (fancy) calculator that manipulates objects

adding_1 <- 1 + 1
subtract_1 <- 19 - 8
divide_1 <- log(x = 98, base = exp(1)) / sqrt(x = 108)
multiply_1 <- 78^83 * (1 / log(x = sqrt(x = 32^3), base = exp(1)))

adding_1 + subtract_1 / (divide_1 * multiply_1)

[1] 2

Notes

The <- operator means that I am assigning what is on the right of the operator to the object that is on the left of the operator.
When defining and manipulating objects, R (1) stores them in the Environment window; and (2) prints the used code and the answer in the Console window.
Always use lowercase letters when naming your objects (R is case-sensitive) and avoid using diacritics or special characters. A good, albeit slightly long, name is lm_income_by_age_robustness_trimmed_1_99. This is a VERY BAD name: Predictor_âge_Sexe_robustness1_sansélèvesdéfaillants. Technically, it works. However, you’ll find yourself having to try very hard to write it correctly every time you need to.

Functions are a central element to R, that builds on objects

Functions are objects that
- Take an input with arguments
- Follow a series of instructions given arguments
- Provide an output given arguments
For example, imagine that you want to square all the values in an object

x <- c(10, 4, 7, 12, 3, 18, 4, 12, 8) # Manually define an object x that takes multiple values

sqrt(x = x)

[1] 3.162278 2.000000 2.645751 3.464102 1.732051 4.242641 2.000000 3.464102
[9] 2.828427

Note

In this case, the function only takes one argument, x. Usually, a function takes multiple arguments (x, y, dataframe, etc.).

Functions are a central element to R, that builds on objects

Functions are objects that
- Take an input with arguments
- Follow a series of instructions given arguments
- Provide an output given arguments
For example, imagine you want to find the lowest value in a given column, add its line number, and then multiply the result by the time of day. You could write a function to perform this task

x <- c(10, 4, 7, 12, 3, 18, 4, 12, 8) # Manually define an object x that takes multiple values

min_x <- function(x) {                          # Take the object x as an input
  i <- which.min(x)                             # Find the smallest value of x
  val <- x[i] + i                               # Add the line number to the smallest value
  val * as.numeric(format(Sys.time(), "%H"))    # Multiply the result by the hour of the day (0–23)
}

min_x(x = x)

[1] 104

Note

In this case, the function only takes one argument, x. Usually, a function takes multiple arguments (x, y, dataframe, etc.).

Packages or libraries contain functions that perform calculations by manipulating objects

Packages are collections of functions, dataframes and documentation that come together
- Some packages are built into R and are part of what is called base R
- Some packages need to be downloaded. These are known as contributed packages and are extremely useful
Very few people work in base R. Those who do so do so because they
- Need the full flexibility of R. For instance, Offer-Westort, Rosenzweig, and Athey (2024) wrote an adaptive assignment algorithm based on balanced linear Thompson sampling in base R, and their code can be accessed online
- Need full control and reproducibility over their analyses. National statistics organisations such as INSEE mostly use base R because they cannot afford their statistics to be altered if an update to a package changes its calculation method

install.packages("tidyverse") # Install a package
library("tidyverse")          # Load    a package

Note

Simply installing a package is not enough: you also need to load it each time you launch a new session.

Find out how to get the most out of packages

Packages can be very powerful. Make the most of them!
- At least, they come with a Github. Just do a Google Search, and you will find their Github. Unfortunately, the github contains more or less informations about the functions and their arguments
- Here, the dplyr package

Find out how to get the most out of packages

Most of the packages pages follow the same structure
- Get started provides an introduction to the package, offering a summary of its key objectives and functions alongside relevant R code
- Reference gathers all functions in one place
- Articles shows how to perform the package’s main functions, with accompanying R code
- News section is the changelog file that tracks all changes for every new release of the package

Packages or libraries contain functions that perform calculations by manipulating objects

By Albert Y. Kim on X

3 Load a dataframe

You are (almost) ready to load your first dataframe: a Comma-Separated Values (CSV) file

Most of the data that you’ll use will be in the same formats
- CSV or TXT. The most used ones for psychologists. Easy to share, light, easy to import/export.
- XLSX. Excel’s format. Also easy to share and to export, although a bit more heavy. Does not have an advantaged over CSV to share a dataframe
- SAV, or SAS, or DTA. Respectively, SPSS’s, SAS’, and Stata’s formats. You’ll learn how to use it when you need to, using the haven package
- PARQUET, a file format specifically designed for (very) large data sets (more information here). Again, you’ll learn how to use it when you need to, using the nanoparquet package
Why almost ready ? Because we have not yet talked about the Working directory
- R needs to be told everything, including were to load the data from. My best advice is to work with R projects
- R Projects are .Rproj files that make your life much easier. Essentially, it tells R what your working directory is, so that you can work with a folder (and subfolders) for each of your data analysis project

Work with R Projects

To define a new R Project, go to File –> New Project… and follow the steps

An alternative is to define the working directory manually. I am not fond of this approach, as it can be cumbersome and vulnerable to even minor changes in the directory file

setwd("C:/Users/jaime/OneDrive/Documents/Emplois/Enseignements/Paris-Cité/2025 - Introduction à R/Support")

You are ready to load your first dataframe: a Comma-Separated Values (CSV) file

I have downloaded a dataframe and saved it is in the folder Data that is in my working directory
- Dataframe is called penguins.csv and contains data about 344 penguins (Horst, Hill, and Gorman 2020). You can also get it there

df <- read.table(file = "Data/penguins.csv", 
                 header = TRUE, 
                 sep = ",", 
                 row.names = "X")

Note

Errors may occur when loading data. CSV files can be temperamental, and there are multiple standards in use. The reason for this is that some standards use commas , as column separators, while some use semicolons ;. Moreover, some standards use periods . as decimal points and some use commas , as decimal points. Sometimes you need to dig in manually. This could be due to the column separator (does it use tabulation?), because the file contains row names that you did not specify (are you using a small x instead of a capital X?), etc… It can be a bit cumbersome — we’ve all been there!

Tip

You can make your life much easier by using File –> Import Dataset 😉

Before we go any further, let’s take a closer look at the `read.table()` function

df <- read.table(file = "Data/penguins.csv", 
                 header = TRUE, 
                 sep = ",", 
                 row.names = "X")

A function takes arguments
- Learn more about a function by typing ?read.table() in the console
- Documentation displays the default options (header = FALSE). Arguments with no option in the documentation indicate that a value is required
- file = part is not necessary, as long as you respect the order of the arguments
- "" are necessary to indicate a string. Else, R will consider that Data/penguins.csv is an object
- Arguments requiring true or false segments can either take TRUE or T; FALSE or F

Check the importation of your first dataframe

colnames() returns the name of the columns (which are vectors)

colnames(x = df)

[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"

Check the importation of your first dataframe

nrow() and ncol() respectively return the number of rows and of columns

nrow(x = df)

[1] 344

ncol(x = df)

[1] 8

Check the importation of your first dataframe

head() returns the first rows of an object

head(x = df)

  species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1  Adelie Torgersen           39.1          18.7               181        3750
2  Adelie Torgersen           39.5          17.4               186        3800
3  Adelie Torgersen           40.3          18.0               195        3250
4  Adelie Torgersen             NA            NA                NA        <NA>
5  Adelie Torgersen           36.7          19.3               193        3450
6  Adelie Torgersen           39.3          20.6               190        3650
     sex year
1   male 2007
2 female 2007
3 female 2007
4   <NA> 2007
5 female 2007
6   male 2007

Check the importation of your first dataframe

str() returns the structure of the object

str(object = df)

'data.frame':   344 obs. of  8 variables:
 $ species          : chr  "Adelie" "Adelie" "Adelie" "Adelie" ...
 $ island           : chr  "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
 $ bill_length_mm   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int  181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : chr  "3750" "3800" "3250" NA ...
 $ sex              : chr  "male" "female" "female" NA ...
 $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

chr, num and int indicate the type of data (vectors) that the dataframe contains

Note

Can you spot the error(s) ?

Check the importation of your first dataframe

summary() returns a summary of the object

summary(object = df)

   species             island          bill_length_mm  bill_depth_mm  
 Length:344         Length:344         Min.   :32.10   Min.   :13.10  
 Class :character   Class :character   1st Qu.:39.23   1st Qu.:15.60  
 Mode  :character   Mode  :character   Median :44.45   Median :17.30  
                                       Mean   :43.92   Mean   :17.15  
                                       3rd Qu.:48.50   3rd Qu.:18.70  
                                       Max.   :59.60   Max.   :21.50  
                                       NA's   :2       NA's   :2      
 flipper_length_mm body_mass_g            sex                 year     
 Min.   :172.0     Length:344         Length:344         Min.   :2007  
 1st Qu.:190.0     Class :character   Class :character   1st Qu.:2007  
 Median :197.0     Mode  :character   Mode  :character   Median :2008  
 Mean   :200.9                                           Mean   :2008  
 3rd Qu.:213.0                                           3rd Qu.:2009  
 Max.   :231.0                                           Max.   :2009  
 NA's   :2

Check the importation of your first dataframe

describe(), well… describes the object

library(psych)
describe(x = df)

                  vars   n    mean    sd  median trimmed   mad    min    max
species*             1 344    1.92  0.89    2.00    1.90  1.48    1.0    3.0
island*              2 344    1.66  0.73    2.00    1.58  1.48    1.0    3.0
bill_length_mm       3 342   43.92  5.46   44.45   43.91  7.04   32.1   59.6
bill_depth_mm        4 342   17.15  1.97   17.30   17.17  2.22   13.1   21.5
flipper_length_mm    5 342  200.92 14.06  197.00  200.34 16.31  172.0  231.0
body_mass_g*         6 342   44.87 24.67   42.00   44.19 29.65    1.0   94.0
sex*                 7 333    1.50  0.50    2.00    1.51  0.00    1.0    2.0
year                 8 344 2008.03  0.82 2008.00 2008.04  1.48 2007.0 2009.0
                  range  skew kurtosis   se
species*            2.0  0.16    -1.73 0.05
island*             2.0  0.61    -0.91 0.04
bill_length_mm     27.5  0.05    -0.89 0.30
bill_depth_mm       8.4 -0.14    -0.92 0.11
flipper_length_mm  59.0  0.34    -1.00 0.76
body_mass_g*       93.0  0.24    -1.06 1.33
sex*                1.0 -0.02    -2.01 0.03
year                2.0 -0.05    -1.51 0.04

Note

describe() treats non-numerical data as numerical and indicates doing so with ✱. Be careful!

Check the importation of your first dataframe

datasummary_skim() describes the object, with a visual aid for numerical data

library(modelsummary)
datasummary_skim(data = df, type = "numeric")

	Unique	Missing Pct.	Mean	SD	Min	Median	Max
bill_length_mm	165	1	43.9	5.5	32.1	44.5	59.6
bill_depth_mm	81	1	17.2	2.0	13.1	17.3	21.5
flipper_length_mm	56	1	200.9	14.1	172.0	197.0	231.0
year	3	0	2008.0	0.8	2007.0	2008.0	2009.0

Check the importation of your first dataframe

datasummary_skim() describes the object

datasummary_skim(data = df, type = "categorical")

		N	%
species	Adelie	152	44.2
	Chinstrap	68	19.8
	Gentoo	124	36.0
island	Biscoe	168	48.8
	Dream	124	36.0
	Torgersen	52	15.1
sex	female	165	48.0
	male	168	48.8
	NA	11	3.2

Caution

Warning: These variables were omitted because they include more than 50 levels: body_mass_g.

Notes

If you cannot remember to which package a function belongs, for instance datasummary_skim(), you can type ??datasummary_skim() in the console. This will take a while, as R will search through every package you have ever installed for the function.

Check the importation of your first dataframe: The types of data

There are six different types of vector, each relating to a different kind of data
- Numeric. Mostly, it is numbers with a decimal
- Integer. Whole numbers (no decimal)
- Logical. TRUE or FALSE data. Can also take values NA or not, representing Missing Values
- Complex. Complex numbers. We won’t cover it, my brain already hurts ($i^2 = -1 \text{ ?!}$)
- Character. String values, whether letters, words, sentences… When ordered, characters become factors (for instance, Never; Sometimes; Always is an ordered factor, or ordered character)
- Raw. We won’t cover it, as it is not used 99.99% of the time
These types of data are stored in vectors, which are R’s fundamental data structure. When it comes to data, in R, everything is either a vector or an aggregate of vectors

Check the importation of your first dataframe: Modify the type of data

R is usually able to assign the correct data type to a vector. However, as you saw, it failed for the body_mass_g column, treating it as a character rather than a numeric value
- Question: Why is this a problem?
- Let’s reassign it. Remember when I said there are base R and contributed packages? I’ll show you two ways to do the same operation

df$body_mass_g <- as.integer(x = df$body_mass_g)

str(object = df$body_mass_g)

 int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...

df <- df %>%
  mutate(body_mass_g = as.integer(body_mass_g))
str(object = df$body_mass_g)

 int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...

Notes

Use the $ sign to access vectors stored in objects, such as a dataframe. This method is not 100% reliable, depending on the structure of the object, but you will learn as you go along.
Most of the time, you don’t need to work in base R to manipulate data. You will learn to use the tidyverse packages (Wickham et al. 2019), which will make your data manipulation much easier.

summary(object = df$body_mass_g)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   2700    3550    4050    4202    4750    6300       2

4 How to manipulate data: The `tidyverse` and the pipe operator `%>%`

The tidyverse is a suite of packages intended to make data manipulation much easier

Again, everything can be done in base R. But, some manipulations are highly cumbersome in base R. To ease data wrangling, a suite of packages (and functions) were made under the name tidyverse
- The tidyverse follows a philosophy of data (Wickham 2014)
1. Each variable forms a column
2. Each observation forms a row (not a person: an observation)
3. Each type of observational unit forms a table

install.packages("tidyverse") # Install a package
library("tidyverse")          # Load    a package

The tidyverse
- Takes in tidy data
- Uses well-named functions that reflect their purpose: for example, rename() renames columns and filter() filters columns according to conditions
- Shares a consistent syntax

Before manipulating data: The pipe operator `%>%`

R works using nested expressions from the inside out. It can get quite ridiculous somehow. To avoid doing so, a pipe operator has been created

Base R

mean(x = 
  sqrt(x = 
    abs(x = 
      log(x = 
        as.numeric(x = 
          factor(x = c(-10, 0, 10, 100), 
                 levels = c(-10, 0, 10, 100))
        )
      )
    )
  )
)

[1] 0.7645279

Using the pipe operator

c(-10, 0, 10, 100) %>%                         # Create a vector
  factor(levels = c(-10, 0, 10, 100)) %>%      # convert to factor
  as.numeric() %>%                             # factor → numeric
  log() %>%                                    # take log
  abs() %>%                                    # absolute value
  sqrt() %>%                                   # square root
  mean()                                       # mean of the result

[1] 0.7645279

The %>% operator takes the object on its left side and makes it the first argument of the function on its right side. Said differently, R can now work linearly!
Another pipe operator |> also exists. You can find a summary of the main differences here. Personally, I use the %>% operator out of habit, and also because it is easier to use with an AZERTY keyboard

Rename vectors (columns in a dataframe)

Let’s translate a column to French 🥖

names(df)[names(df) == "species"] <- "Espèce"

df <- df %>% 
  rename("Espèce" = "species")

A tidyverse workflow, taken from Wickham, Cetinkaya-Rundel, and Grolemund (2023).

Create a new vector (column in a dataframe), based on an existing one

For example, what if I prefer to count in pounds rather than grams?

df$body_mass_pounds <- df$body_mass_g / 453.59237

df <- df %>% 
  mutate(body_mass_pounds = body_mass_g / 453.59237)

Get the mean and the standard deviation of this new column

mean(x = df$body_mass_pounds)

[1] NA

sd(x = df$body_mass_pounds)

[1] NA

R needs to be told what to do with missing values (NAs)

mean(x = df$body_mass_pounds, na.rm = TRUE)

[1] 9.263283

sd(x = df$body_mass_pounds, na.rm = TRUE)

[1] 1.768007

Filter a dataframe, according to a condition in a vector (column)

df %>%
  filter(body_mass_g >= mean(body_mass_g)) %>%
  nrow()

[1] 0

df %>%
  filter(body_mass_g >= mean(body_mass_g, 
                             na.rm = TRUE)) %>%
  nrow()

[1] 149

Pivot a dataframe: 1/2

We sometimes need to pivot data in order to run a repeated measures analysis

id	group	before	after
1	control	40.48391	42.24692
2	control	52.06518	54.77531
3	control	53.46311	40.54112
4	control	52.94401	53.91534
5	control	53.86674	51.82061
6	test	45.80264	52.06510
7	test	54.11571	57.03168
8	test	49.76629	66.47595
9	test	56.28729	56.44203
10	test	53.39197	64.62119

Pivot a dataframe: 2/2

We sometimes need to pivot data in order to run a repeated measures analysis

df_wide %>%
  pivot_longer(
    cols = c(before,
             after),
    names_to = "time",
    values_to = "value")

Note

A pivot_wider() function is also available, depending on the required pivoting.

id	group	time	value
1	control	before	40.48391
1	control	after	42.24692
2	control	before	52.06518
2	control	after	54.77531
3	control	before	53.46311
3	control	after	40.54112
4	control	before	52.94401
4	control	after	53.91534
5	control	before	53.86674
5	control	after	51.82061
6	test	before	45.80264
6	test	after	52.06510
7	test	before	54.11571
7	test	after	57.03168
8	test	before	49.76629
8	test	after	66.47595
9	test	before	56.28729
9	test	after	56.44203
10	test	before	53.39197
10	test	after	64.62119

5 Summarise data

The base R functions `table()` and `prop.table()`

R has a built-in table() function. It won’t get you very far, but it is sometimes enough to quickly skim through your data (check how many categories a column has; whether some categories need to be merged together; the number of observations per category…)

df %>%
  select(sex, species) %>%
  table()

        species
sex      Adelie Chinstrap Gentoo
  female     73        34     58
  male       73        34     61

You can get the percentages by adding the prop.table() function

df %>%
  select(sex, species) %>%
  table() %>%
  prop.table() * 100

        species
sex        Adelie Chinstrap   Gentoo
  female 21.92192  10.21021 17.41742
  male   21.92192  10.21021 18.31832

Careful!

We did not check for missing data thoroughly enough. Reporting these tables without further exploring whether (and how much) data is missing would be a bold move.

Summarise data, the tough way

df %>%

  summarize(
    mean = mean(body_mass_g),
    sd = sd(body_mass_g),
    n = n(),
    se = sd / sqrt(n),
    confint_low = mean - 1.96 * se,
    confint_high = mean + 1.96 * se,
    .by = sex)

     sex     mean       sd   n       se confint_low confint_high
1   male 4545.685 787.6289 168 60.76689    4426.581     4664.788
2 female 3862.273 666.1720 165 51.86142    3760.624     3963.921
3   <NA>       NA       NA  11       NA          NA           NA

df %>%
  drop_na() %>%
  summarize(
    mean = mean(body_mass_g),
    sd = sd(body_mass_g),
    n = n(),
    se = sd / sqrt(n),
    confint_low = mean - 1.96 * se,
    confint_high = mean + 1.96 * se,
    .by = sex)

     sex     mean       sd   n       se confint_low confint_high
1   male 4545.685 787.6289 168 60.76689    4426.581     4664.788
2 female 3862.273 666.1720 165 51.86142    3760.624     3963.921

In this case, applying the drop_na() function to the entire dataset does not alter the results. However, most of the time it will. It would be better to do the following. Why?

df %>%
  select(body_mass_g, sex) %>%
  drop_na() %>%
  summarize(
    mean = mean(body_mass_g),
    sd = sd(body_mass_g),
    n = n(),
    se = sd / sqrt(n),
    confint_low = mean - 1.96 * se,
    confint_high = mean + 1.96 * se,
    .by = sex)

     sex     mean       sd   n       se confint_low confint_high
1   male 4545.685 787.6289 168 60.76689    4426.581     4664.788
2 female 3862.273 666.1720 165 51.86142    3760.624     3963.921

Summarise data, the tough way, but make it pretty(ish) using the `flextable()` package 1/2

Sex	Mean body mass (g)	sd	Number of penguins	se	95% confidence interval [lower]	95% confidence interval [upper]
Male	4,545.68	787.63	168	60.77	4,426.58	4,664.79
Female	3,862.27	666.17	165	51.86	3,760.62	3,963.92

Summarise data, the tough way, but make it pretty(ish) using the `flextable()` package 2/2

# Install the package and load it
install.packages("flextable")
library("flextable")

# Construct the table
df %>%
  select(body_mass_g, sex) %>%
  drop_na() %>%
  summarize(
    mean = mean(body_mass_g),
    sd = sd(body_mass_g),
    n = n(),
    se = sd / sqrt(n),
    confint_low = mean - 1.96 * se,
    confint_high = mean + 1.96 * se,
    .by = sex) %>%
  
  rename("Sex" = "sex",
         "Mean body mass (g)" = "mean",
         "Number of penguins" = "n",
         "95% confidence interval [lower]" = "confint_low",
         "95% confidence interval [upper]" = "confint_high") %>%
  
  mutate(across(where(is.numeric), ~ round(.x, 2)),
         Sex = ifelse(test = Sex == "male", yes = "Male", no = "Female")) %>%

# Format the table
  flextable() %>%
  bold(part = "header") %>%
  autofit() # %>%
# save_as_docx(path = "/Tables/Penguins_table_1.docx") # Or save_as_html()

Summarise data, the tough way, and export it to LaTeX using `knitr()` and `kableExtra()` packages

library("knitr")
library("kableExtra")

df %>%
  select(body_mass_g, sex) %>%
  drop_na() %>%
  summarize(
    mean = mean(body_mass_g),
    sd = sd(body_mass_g),
    n = n(),
    se = sd / sqrt(n),
    confint_low = mean - 1.96 * se,
    confint_high = mean + 1.96 * se,
    .by = sex) %>%
  rename("Sex" = "sex",
         "Mean body mass (g)" = "mean",
         "Number of penguins" = "n",
         "95% confidence interval [lower]" = "confint_low",
         "95% confidence interval [upper]" = "confint_high") %>%
  mutate(across(where(is.numeric), ~ round(.x, 2)),
         Sex = ifelse(Sex == "male", "Male", "Female")) %>%
  
  kable(format = "latex", booktabs = TRUE, align = "c") %>%
  kable_styling(latex_options = c("hold_position", "striped")) %>%
  save_kable("Tables/Penguins_table_1.tex")

Summarise data the easier way, using the `gtsummary` package

library(gtsummary)
theme_gtsummary_compact()

df %>%
  select(sex, species) %>%
  
  tbl_summary(by = "species", 
              percent = "cell", 
              statistic = all_categorical() ~ "{p}%")

Characteristic	Adelie N = 152¹	Chinstrap N = 68¹	Gentoo N = 124¹
sex
female	22%	10%	17%
male	22%	10%	18%
Unknown	6	0	5
¹

df %>%
  select(sex, species) %>%
  drop_na() %>%
  tbl_summary(by = "species", 
              percent = "cell", 
              statistic = all_categorical() ~ "{p}%")

Characteristic	Adelie N = 146¹	Chinstrap N = 68¹	Gentoo N = 119¹
sex
female	22%	10%	17%
male	22%	10%	18%
¹

Careful!

We were right to be careful: 11 observations had missing data for penguins’ sex!

Combine multiple data wrangling functions at once, to summarise data

df %>%
  select(body_mass_g, sex, species, bill_length_mm, bill_depth_mm, year) %>%
  mutate(body_mass_pounds = body_mass_g / 453.59237,
         sex = case_when(sex == "female" ~ "Femelle",
                         sex == "male"   ~ "Mâle")) %>%
  rename("Espèce" = "species",
         "企鵝喙的長度（單位：毫米）" = "bill_length_mm",
         "عمق المنقار (مم) )" = "bill_depth_mm") %>%
  filter(body_mass_pounds >= (mean(body_mass_pounds, na.rm = TRUE) + sd(body_mass_pounds, na.rm = TRUE))) %>%
  drop_na() %>%
  
  tbl_summary(by = "sex", 
              statistic = list(all_continuous() ~ "{mean} ({sd})"))

Characteristic	Femelle N = 5¹	Mâle N = 56¹
Espèce
Gentoo	5 (100%)	56 (100%)
island
Biscoe	5 (100%)	56 (100%)
企鵝喙的長度（單位：毫米）	46.16 (1.76)	49.59 (2.76)
عمق المنقار (مم)	14.44 (0.65)	15.76 (0.75)
flipper_length_mm	214 (5)	222 (6)
body_mass_g	5,140 (65)	5,534 (276)
year
2007	1 (20%)	17 (30%)
2008	3 (60%)	19 (34%)
2009	1 (20%)	20 (36%)
body_mass_pounds	11.33 (0.14)	12.20 (0.61)
¹ n (%); Mean (SD)

6 Do we have time for linear models?

If we have the time, we’ll fit some linear models. If not, that will be a bonus for you at home

You might be familiar with good old $Y_i = \alpha + \beta X_i + \varepsilon_i$, also called the regression by its friends. Let’s fit it, with one, two, and three predictors, with interactions, and have a look at the results

\[ \begin{align} \text{Model 1: } Y_i &= \alpha + \varepsilon_i \\ \text{Model 2: } Y_i &= \alpha + \beta \, X_{i,\text{flipper length}} + \varepsilon_i \\ \text{Model 3: } Y_i &= \alpha + \beta \, X_{i,\text{flipper length}} + \lambda \, X_{i,\text{bill depth}} + \varepsilon_i \\ \text{Model 4: } Y_i &= \alpha + \beta \, X_{i,\text{flipper length}} + \lambda \, X_{i,\text{bill depth}} + \gamma \, (X_{i,\text{flipper length}} \times X_{i,\text{bill depth}}) + \varepsilon_i \end{align} \]

lm1 <- lm(formula = body_mass_g ~ 1, data = df)
lm2 <- lm(body_mass_g ~ 1 + flipper_length_mm, df)
lm3 <- lm(body_mass_g ~ 1 + flipper_length_mm + bill_depth_mm, df)
lm4 <- lm(body_mass_g ~ 1 + flipper_length_mm + bill_depth_mm + flipper_length_mm:bill_depth_mm, df)

Check the results, using base R `summary()` function

summary(lm1)


Call:
lm(formula = body_mass_g ~ 1, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-1501.8  -651.8  -151.8   548.2  2098.2 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  4201.75      43.36   96.89 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 802 on 341 degrees of freedom
  (2 observations effacées parce que manquantes)

Check the results, using base R `summary()` function

summary(lm2)


Call:
lm(formula = body_mass_g ~ 1 + flipper_length_mm, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-1058.80  -259.27   -26.88   247.33  1288.69 

Coefficients:
                   Estimate Std. Error t value            Pr(>|t|)    
(Intercept)       -5780.831    305.815  -18.90 <0.0000000000000002 ***
flipper_length_mm    49.686      1.518   32.72 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 394.3 on 340 degrees of freedom
  (2 observations effacées parce que manquantes)
Multiple R-squared:  0.759, Adjusted R-squared:  0.7583 
F-statistic:  1071 on 1 and 340 DF,  p-value: < 0.00000000000000022

Check the results, using base R `summary()` function

summary(lm3)


Call:
lm(formula = body_mass_g ~ 1 + flipper_length_mm + bill_depth_mm, 
    data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-1029.78  -271.45   -23.58   245.15  1275.97 

Coefficients:
                   Estimate Std. Error t value            Pr(>|t|)    
(Intercept)       -6541.907    540.751 -12.098 <0.0000000000000002 ***
flipper_length_mm    51.541      1.865  27.635 <0.0000000000000002 ***
bill_depth_mm        22.634     13.280   1.704              0.0892 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 393.2 on 339 degrees of freedom
  (2 observations effacées parce que manquantes)
Multiple R-squared:  0.761, Adjusted R-squared:  0.7596 
F-statistic: 539.8 on 2 and 339 DF,  p-value: < 0.00000000000000022

Check the results, using base R `summary()` function

summary(lm4)


Call:
lm(formula = body_mass_g ~ 1 + flipper_length_mm + bill_depth_mm + 
    flipper_length_mm:bill_depth_mm, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-938.88 -253.96  -28.25  220.66 1048.33 

Coefficients:
                                  Estimate Std. Error t value
(Intercept)                     -36097.064   4636.271  -7.786
flipper_length_mm                  196.074     22.603   8.675
bill_depth_mm                     1771.796    273.003   6.490
flipper_length_mm:bill_depth_mm     -8.596      1.340  -6.414
                                            Pr(>|t|)    
(Intercept)                       0.0000000000000855 ***
flipper_length_mm               < 0.0000000000000002 ***
bill_depth_mm                     0.0000000003057955 ***
flipper_length_mm:bill_depth_mm   0.0000000004782702 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 371.8 on 338 degrees of freedom
  (2 observations effacées parce que manquantes)
Multiple R-squared:  0.787, Adjusted R-squared:  0.7851 
F-statistic: 416.2 on 3 and 338 DF,  p-value: < 0.00000000000000022

Check the results, using the `modelsummary` package

library("modelsummary")

list_lm <- list(
  "Model 1" = lm1,
  "Model 2" = lm2,
  "Model 3" = lm3,
  "Model 4" = lm4)

modelsummary(list_lm,
             stars = T,
             fmt = 2,
             vcov = "HC3",
             title = "Table 1: An example of a couple of different linear models.",
             notes = "Notes. Y = Body mass (in grams). Robust standard errors 'HC3'. Noice, right!?",
             coef_rename = c(
               "flipper_length_mm" = "Flipper length (mm)",
               "bill_depth_mm" = "Bill depth (mm)",
               "flipper_length_mm:bill_depth_mm" = "Flipper length (mm) X Bill depth (mm)"), 
             gof_map = c("nobs", "adj.r.squared"))

Check the results, using the `modelsummary` package

Table 1: An example of a couple of different linear models.
	Model 1	Model 2	Model 3	Model 4
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
Notes. Y = Body mass (in grams). Robust standard errors 'HC3'. Noice, right!?
(Intercept)	4201.75***	-5780.83***	-6541.91***	-36097.06***
	(43.43)	(294.88)	(500.35)	(4389.62)
Flipper length (mm)		49.69***	51.54***	196.07***
		(1.45)	(1.75)	(21.46)
Bill depth (mm)			22.63+	1771.80***
			(12.18)	(259.54)
Flipper length (mm) X Bill depth (mm)				-8.60***
				(1.28)
Num.Obs.	342	342	342	342
R2 Adj.	0.000	0.758	0.760	0.785

7 Advice before we leave 😭

From Peter Peregrine’s blog

Functions and packages you’ll some day need

ANOVA: the afex package (see its Github)
Generalized linear model: the glm() function (see here)
Random effects model: the lme4 package. See Bates et al. (2015) and Kuznetsova, Brockhoff, and Christensen (2017)
Fixed effects (panel) model: fixest (Bergé 2018) or plm (Croissant and Millo 2008) packages
Marginal means: the marginaleffects package (Arel-Bundock, Greifer, and Heiss 2024). See also the accompanying free e-book Model to Meaning, as well as Rohrer & Arel-Bundock (2025)
Plot data: the marvellous ggplot2 package (Wickham 2016, with the free e-books ggplot2: Elegant Graphics for Data Analysis and R Graphics Cookbook)
Sandwich (or other kind of) standard errors: the sandwich package (Zeileis, Köll, and Graham 2020), which is incorporated in modelsummary (see its vcov argument here)
Power analysis: the pwr package (see its vignette)
Psychometrics: the lavaan package for EFAs, CFAs, SEMs… (Rosseel 2012, with its website)
Survey data: the survey package (Lumley 2004, with online vignettes and a nice free e-book)
To check that your model is performing appropriately: the performance package (Lüdecke et al. 2021 and its vignettes)

R is highly logical: think, ask, look for answers on forums!

What should you do if your code isn’t working?
- Things to avoid: never use a GPT to correct your code. Of course, it can solve and even write the code for you. However, getting the answer does not constitute learning. R coding is not easy: it requires effort and is challenging. Taking the easy road is not a solution
- Things to do: The most important packages (such as the tidyverse) have a really nice webpage. They sometimes come with a Cheat Sheet (here for dplyr and here for tidyr). They might even have an accompanying book (Wickham, Cetinkaya-Rundel, and Grolemund 2023) which can be freely accessed online
What to do when you do not know how to do what you are aiming to do?
- R is a highly logical operator. First thing to do is to ask yourself: (1) Where am I? (2) Where do I want to go, very precisely? (3) What are the logical steps that I should follow to go from (1) to (2)? Doing so, you’ll break the steps down to small bits, which are more manageable
- Finally, the vast majority of your questions have already been answered online. Try Stack Overflow, which is dedicated to programming and has a section specifically for R. You can also try Cross Validated, which is dedicated to statistics. Look for answers before asking a new question!

Example of a Cheat sheet, for the `dplyr` package 1/2

Example of a Cheat sheet, for the `dplyr` package 2/2

8 (Free) online ressources

R for Data Science (Wickham, Cetinkaya-Rundel, and Grolemund 2023). Excellent introduction to the basics of R. Cumbersome, but one needs to start at the bottom to really master R, freely available online
R Programming for Data Science by Roger D. Peng, great book about the fondamentals of R, freely available online
Tidy data (Wickham 2014). Defines tidy as a philosophy of data. It walks you through the logical steps involved in transforming your data frames
Welcome to the Tidyverse (Wickham et al. 2019). An introduction to the Tidyverse, defined as a suite of packages that follow the tidy principles to mostly wrangle, clean and prepare your data frames, at freely accessible online
The Tidyverse style guide, to learn how to format your code
Julien Barnier (2024)’s book (in French) Introduction à R et au tidyverse, freely accessible online
Joseph Larmarange (2024)’s book (in French) guide-R pour l’analyse de données d’enquêtes avec R, freely accessible online

9 References

Arel-Bundock, Vincent, Noah Greifer, and Andrew Heiss. 2024. “How to Interpret Statistical Models Using Marginaleffects for r and Python.” Journal of Statistical Software 111 (9): 1–32. https://doi.org/10.18637/jss.v111.i09.

Barnier, Julien. 2024. Introduction à r Et Au Tidyverse.

Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using Lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.

Bergé, Laurent. 2018. “Efficient Estimation of Maximum Likelihood Models with Multiple Fixed-Effects: The R Package FENmlm.” CREA Discussion Papers, no. 13.

Croissant, Yves, and Giovanni Millo. 2008. “Panel Data Econometrics in R: The plm Package.” Journal of Statistical Software 27 (2): 1–43. https://doi.org/10.18637/jss.v027.i02.

Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data. https://doi.org/10.5281/zenodo.3960218.

Kuznetsova, Alexandra, Per B. Brockhoff, and Rune H. B. Christensen. 2017. “lmerTest Package: Tests in Linear Mixed Effects Models.” Journal of Statistical Software 82 (13): 1–26. https://doi.org/10.18637/jss.v082.i13.

Larmarange, Joseph. 2024. Introduction à l’analyse d’enquêtes Avec r Et RStudio.

Lüdecke, Daniel, Mattan S Ben-Shachar, Indrajeet Patil, Philip Waggoner, and Dominique Makowski. 2021. “Performance: An r Package for Assessment, Comparison and Testing of Statistical Models.” Journal of Open Source Software 6 (60).

Lumley, Thomas. 2004. “Analysis of Complex Survey Samples.” Journal of Statistical Software 9 (1): 1–19.

Offer-Westort, Molly, Leah R Rosenzweig, and Susan Athey. 2024. “Battling the Coronavirus ‘Infodemic’among Social Media Users in Kenya and Nigeria.” Nature Human Behaviour 8 (5): 823–34.

Rohrer, Julia M, and Vincent Arel-Bundock. 2025. “Models as Prediction Machines: How to Convert Confusing Coefficients into Clear Quantities.” https://doi.org/10.31234/osf.io/g4s2a_v1.

Rosseel, Yves. 2012. “lavaan: An R Package for Structural Equation Modeling.” Journal of Statistical Software 48 (2): 1–36. https://doi.org/10.18637/jss.v048.i02.

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59: 1–23.

———. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Wickham, Hadley, Mine Cetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. 2nd ed. O’Reilly Media.

Zeileis, Achim, Susanne Köll, and Nathaniel Graham. 2020. “Various Versatile Variances: An Object-Oriented Implementation of Clustered Covariances in R.” Journal of Statistical Software 95 (1): 1–36. https://doi.org/10.18637/jss.v095.i01.

Introduction to R

1 Lecture organisation

We only have three hours, but we’ll make the most of them!

2 The (very) basics of R

Why use R ?

R’s logic

R’s logic

Very few people use base R, most use an Integrated Development Environment (IDE) such as RStudio

Very few people use base R, most use an Integrated Development Environment (IDE) such as RStudio

Very few people use base R, most use an Integrated Development Environment (IDE) such as RStudio

Very few people use base R, most use an Integrated Development Environment (IDE) such as RStudio

Very few people use base R, most use an Integrated Development Environment (IDE) such as RStudio

Very few people use base R, most use an Integrated Development Environment (IDE) such as RStudio

R is an object-oriented language

R is basically a big (fancy) calculator

R is basically a big (fancy) calculator that manipulates objects

Functions are a central element to R, that builds on objects

Functions are a central element to R, that builds on objects

Packages or libraries contain functions that perform calculations by manipulating objects

Find out how to get the most out of packages

Find out how to get the most out of packages

Find out how to get the most out of packages

Packages or libraries contain functions that perform calculations by manipulating objects

3 Load a dataframe

You are (almost) ready to load your first dataframe: a Comma-Separated Values (CSV) file

Work with R Projects

You are ready to load your first dataframe: a Comma-Separated Values (CSV) file

Before we go any further, let’s take a closer look at the read.table() function

Check the importation of your first dataframe

Check the importation of your first dataframe

Check the importation of your first dataframe

Check the importation of your first dataframe

Check the importation of your first dataframe

Check the importation of your first dataframe

Check the importation of your first dataframe

Check the importation of your first dataframe

Check the importation of your first dataframe: The types of data

Check the importation of your first dataframe: Modify the type of data

4 How to manipulate data: The tidyverse and the pipe operator %>%

The tidyverse is a suite of packages intended to make data manipulation much easier

Before manipulating data: The pipe operator %>%

Rename vectors (columns in a dataframe)

Create a new vector (column in a dataframe), based on an existing one

Filter a dataframe, according to a condition in a vector (column)

Pivot a dataframe: 1/2

Pivot a dataframe: 2/2

5 Summarise data

The base R functions table() and prop.table()

Summarise data, the tough way

Summarise data, the tough way, but make it pretty(ish) using the flextable() package 1/2

Summarise data, the tough way, but make it pretty(ish) using the flextable() package 2/2

Summarise data, the tough way, and export it to LaTeX using knitr() and kableExtra() packages

Summarise data the easier way, using the gtsummary package

Combine multiple data wrangling functions at once, to summarise data

6 Do we have time for linear models?

If we have the time, we’ll fit some linear models. If not, that will be a bonus for you at home

Check the results, using base R summary() function

Check the results, using base R summary() function

Check the results, using base R summary() function

Check the results, using base R summary() function

Check the results, using the modelsummary package

Check the results, using the modelsummary package

7 Advice before we leave 😭

Functions and packages you’ll some day need

R is highly logical: think, ask, look for answers on forums!

Example of a Cheat sheet, for the dplyr package 1/2

Example of a Cheat sheet, for the dplyr package 2/2

8 (Free) online ressources

9 References

Before we go any further, let’s take a closer look at the `read.table()` function

4 How to manipulate data: The `tidyverse` and the pipe operator `%>%`

Before manipulating data: The pipe operator `%>%`

The base R functions `table()` and `prop.table()`

Summarise data, the tough way, but make it pretty(ish) using the `flextable()` package 1/2

Summarise data, the tough way, but make it pretty(ish) using the `flextable()` package 2/2

Summarise data, the tough way, and export it to LaTeX using `knitr()` and `kableExtra()` packages

Summarise data the easier way, using the `gtsummary` package

Check the results, using base R `summary()` function

Check the results, using base R `summary()` function

Check the results, using base R `summary()` function

Check the results, using base R `summary()` function

Check the results, using the `modelsummary` package

Check the results, using the `modelsummary` package

Example of a Cheat sheet, for the `dplyr` package 1/2

Example of a Cheat sheet, for the `dplyr` package 2/2